library(readr)
library(dplyr)
library(lubridate)
library(ggplot2)
library(readxl)
library(tidyr)
library(plotly)
library(stringr)
library(tidyselect)Your take home final consists of 3 parts. First part is about some simple questions and their answers. These questions might include coding, brief comments or direct answers. Second part is about your group projects. You are asked to make a contribution to your project report with an additional analysis with two/three visualizations. Third part is about gathering real life data and conducting analysis on it. Here are significant points that you should read carefully.
The purpose of this part is to gauge your apprehension about data manipulation, visualization and data science workflow in general. Most questions have no single correct answer, some don’t have good answers at all. It is possible to write many pages on the questions below but please keep it short. Constrain your answers to one or two paragraphs (7-8 lines tops).
What is your opinion about the AI hype? Do you think that, in 10 years, AI will solve many critical problems? What about the required human capital to build all the AI? If we consider a sufficient level a 100 what would be our level as Turkey / World? Why do you think so? (Give a number for both Turkey and the World and defend your scoring)
What is your exploratory data analysis workflow? Suppose you are given a data set and a research question. Where do you start? How do you proceed? For instance, you are given the task to distribute funds from donations to public welfare projects in a wide range of subjects (e.g. education, gender equality, poverty, job creation, healthcare) with the objective of maximum positive impact on the society in general. Assume you have almost all the data you require. How do you measure impact? How do you form performance measures? What makes you think you find an interesting angle?
In this part you are going to extend your group project with a single additional analysis supported by some visualization. You are tasked with finding the best improvement on the top of your group project. About one page is enough, two pages tops.
As all of you know well enough; real life data is not readly available and it is messy. Also you will face situations where you need to discover and learn another framework. In this part, you are going to gather data about organic agriculture production from Ministry of Agriculture and Forestry. You should use (Organik Tarımsal Üretim Verileri) between 2014-2018 from https://www.tarimorman.gov.tr/Konular/Bitkisel-Uretim/Organik-Tarim/Istatistikler. Take some time to see what is offered in the data sets. Choose an interesting theme which can be analyzed with the given data and collect relevant data from the service. Some example themes can be as follows.
Tip: You can use readxl package for xlsx and xls files.
1. I think that AI is a buzzword of the 21st century. Although it has a lot of potential to solve problems, it will probably give rise to new problems. Since AI is based on data analysis, personal data sharing is an important brake for AI. It is important to what extent people and institutions want to share their personal data. Artificial intelligence needs to make a fair decision about processes. Because it is a very easy field to manipulate, different results can be corrected in the hands of the wrong people. Nevertheless, I think it will play an important role in solving too many human problems.
In order to construct artificial intelligence, it may be necessary to encourage people with different domain knowledge to learn what they can do with data analysis. This will support AI-based thinking. In addition, data collection on processes should be systematic and orderly.
If you need to compare Turkey with the world, we’re behind the world average as the angle of view of artificial intelligence and technology. If we assess the current situation with the world as 50, turkey will be up to 7-8.
2. First of all, I try to understand columns of data. To do this, it can look at descriptive statistics of columns such as their mean, median, distributions. summary() function can be useful for this purpose.
3. I want to analysis whether the carats of diamonds and their price are related. The easiest way to understand the correlation between two variables is to examine the distribution graph. To do this, I used log() function to understand graph easily. Additionaly, I wondered what effect the color of the diamond has on the price.
library(ggplot2)
ggplot(data = diamonds) +
geom_point(mapping = aes(x=carat, y = log(price), color = color)) +
coord_cartesian() +
theme_light() +
theme(plot.title = element_text(color = "cadetblue4", size=20)) +
ggtitle("Carat vs. Price") +
labs(x = "Carat", y = "Price")Results :
As it can be seen on the above, carat plays a major role on diamond’s price. Also, diamond with D color is more expensive than diamond with J.
In term project, we examined that how affect interest rates on economic indicators. I wonder that whether tech exports are related to consumer price index. research and development expenditure vs. consumer price index researchers in rd
tech$date <-ymd(tech$date, truncated = 2L)
interest_rate$year <- as.Date(interest_rate$year)
k <- tech %>% select(-c(1,2)) %>%
filter(tech$date > '2009-01-01') %>%
filter(indicator_code == "GB.XPD.RSDV.GD.ZS") %>% #Research and development expenditure (% of GDP)
ggplot() +
geom_line(aes(date, value), color="sky blue") +
geom_point(aes(date, value), color="sky blue") +
theme_light() +
theme(plot.title = element_text(color = "cadetblue4", size=20)) +
ggtitle("Research and development expenditure (% of GDP)") +
scale_y_continuous("Expenditure (% of GDP)",
sec.axis = sec_axis(~ . * 325, name = "Average Consumer Price Index"))
filtered <- general_price_index %>%
mutate(Year = year(Date)) %>%
filter(index_type == "CPI") %>%
group_by(Year) %>%
summarize(average = mean(index_value)) %>%
filter(Year > 2009) %>%
filter(Year < 2018)
filtered$Year <- ymd(filtered$Year, truncated = 2L)
k + geom_line(data=filtered, aes(x=Year, y=average*0.0031, group=1), color = "orange") +
geom_point(data=filtered, aes(x=Year, y=average*0.0031, group=1), color = "orange")According to graph above, both Research and development expenditure (% of GDP) and consumer price index have increasing year by year. Unfortunetely, there is no enough evidence that
a. Below, you can find about my dataset for organic agriculture statistics.
finalExam_dataset has information about organic agriculture statistics which contain the number of farmer, gross production, production area on yearly basis and cities. You can download data from link.
Rdatafile contain two dataframes which have agriculture statistics with detailed product information and total production information.
organic agriculture dataframe with 17597 rows and 4 column:
Total organic agriculture dataframe with 527 rows and 8 column:
b. Project aim: Find the top 5 cities which produce organic products mostly, change in their production amount, the proportion of gathering naturally, and change in their production amonut of popular products.
kumulatif_uretim <- total_organic_agriculture %>%
mutate(birim_uretim = Uretim_miktari / (Gercek_uretim_alani + Dogal_toplama_alani)) %>%
group_by(iller) %>%
summarise(kümüle_birim_uretim = sum(birim_uretim)) %>%
arrange(., desc(kümüle_birim_uretim))
cup <- kumulatif_uretim %>%
ggplot() +
geom_bar(aes(iller, kümüle_birim_uretim), stat = "identity", fill = "sky blue") +
theme_classic() +
labs(x="Cities", y="Cumulative Unit Production", title="Cumulative Unit Production in Turkey") +
theme(axis.text.x = element_text(angle=90))
cup <- ggplotly(cup)
cuptop_organic_cities <- factor(c("Kilis", "Konya", "Niğde", "Kars", "Yalova", "Bilecik")) # assigning top 6 city
organic_agriculture$dogadan_toplama <- factor(organic_agriculture$dogadan_toplama)total_organic_agriculture %>%
filter(total_organic_agriculture$iller %in% top_organic_cities) %>%
ggplot() +
geom_line(aes(yıl, log(Uretim_miktari))) +
facet_wrap(top_organic_cities) +
theme_light() +
labs(x="Year", y="Production Amount", title="Production Amount in Top 6 Organic Cities by year") +
theme(axis.text.x = element_text(angle=90))g2 <- organic_agriculture %>%
filter(iller %in% top_organic_cities) %>%
group_by(iller, dogadan_toplama) %>%
summarise(production = sum(Uretim_miktari)) %>%
ggplot() +
geom_bar(mapping = aes(iller, production, fill=dogadan_toplama), stat = "identity", position = position_fill()) +
ylab("proportion") +
theme_light() +
labs(x="Cities", y="Proportion", title="Proportion of gathering naturally in top 6 organic cities")
g2 <- ggplotly(g2)
g2top_product <- organic_agriculture %>%
group_by(Urun_adi) %>%
summarise(total_production = sum(Uretim_miktari)) %>%
arrange(desc(total_production)) %>%
head(n=10)
g3 <- organic_agriculture %>%
select(Urun_adi, Uretim_miktari, yıl) %>%
filter(Urun_adi %in% top_product$Urun_adi) %>%
group_by(Urun_adi, yıl) %>%
summarise(total_product = sum(Uretim_miktari)) %>%
ggplot() +
geom_line(aes(yıl, log(total_product), colour = Urun_adi)) +
theme_light() +
labs(x="Year", y="Production Amount", title="Production Amount of Top Products on Yearly Basis")
g3 <- ggplotly(g3)
g3